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Abstract 

This paper addresses the problem of semiparametric efficiency bounds for condi- 
tional moment restriction models with different conditioning variables. We charac- 
terize such an efficiency bound, that in general is not explicit, as a limit of explicit 
efficiency bounds for a decreasing sequence of unconditional (marginal) moment re- 
striction models. An iterative procedure for approximating the efficient score when 
this is not explicit is provided. Our theoretical results complete and extend existing 
results in the literature, provide new insight for the theory of semiparametric effi- 
ciency bounds literature and open the door to new applications. In particular, we 
investigate a class of regression- like (mean regression, quantile regression,...) models 
with missing data. 

1 The model 

Conditional moment restriction models represent a large class of statistical models. Seem- 
ingly unrelated nonlinear regressions, see Gallant (1975), Miiller (2009), seemingly unre- 
lated quantile regressions, see Jun and Pinske (2009), regression models with missing data, 
see Robins, Rotnitzky and Zhao (1994), Tsiatis (2006), are only few examples and related 
contributions. Ai and Chen (2009) and Hansen (2007) provide many other references and 
examples of econometric models that could be stated as conditional moment restriction 
models. 

In this paper we address the problem of calculating semiparametric efficiency bounds 
in models defined by several conditional moment restrictions with possibly different con- 
ditioning variables. More formally, the sample under study consists of independent copies 



* CREST, Ecole Nationale de la Statistique et de 1' Analyse de l'Infdormation (Ensai), Campus de 
Ker-Lann, rue Blaise Pascal, BP 37203, 35172 Bruz, cedex, France. Authors emails: hristach@ensai.fr, 
patilea@ensai.fr 



1 



of a random vector Z £ Z C M. q . Let J be some positive integer that is fixed in the 
following. For any j £ {1, . . . , J}, let X^' be a random qj— dimension subvector of Z, 
where < q 5 < q. Let gj : WxR d -» R«, je{l,...,J}, denote given functions of Z and 
the unknown parameter 9 £ C R d . The semiparametric model we consider is defined 
by the conditional moment restrictions 

E [gj (Z, 9) | =0, j = 1, . . . , J, almost surely. (1) 

It is assumed that the d— dimension parameter 9 is identified by the conditional restric- 
tions, which means there exists a unique value 9 such that the true law of Z satisfies 
equations ([1]). By definition, is a constant random variable when qj = 0, and hence 
the conditional expectation given X® is the marginal expectation. 

Particular cases of this model have been extensively studied in the literature. For 
J = 1 and qi = we obtain a model defined by an unconditional set of moment equations 

E[g(Z,9)] = 0. 

Hansen (1982) considered the class of GMM estimators and showed how to construct an 
optimal one in this class. Its asymptotic variance equals the the semiparametric efficiency 
bound obtained by Chamberlain (1987). 

The GMM method extends naturally to models defined by conditional moment equa- 
tions, corresponding to the case J = 1 and q\ > in our setting, that is 

E[g(Z,9) | X] =0. 

From a mathematical point of view, such a model is equivalent to the intersection of the 
models of the form 

E[a(X) g(Z,9)}=0, 

where a (X) is an arbitrary conformable random matrix whose entries are square inte- 
grable. Following the econometric literature, a (X) is referred to as a matrix of instru- 
ments. The supremum of the information on 9 in these models yields the semiparametric 
Fisher information on 9 in the conditional equation model, obtained by Chamberlain 
(1992a). It is also the information on 9q for the unconditional moment equation 

E[a*(X) g(Z,9)} = 0, 

with properly chosen 'optimal' instruments a* (X). 

A further generalization, which can also be written under the form (JTJ), is given by a 
sequential (nested) moment restrictions model, in which the a— fields generated by the 
conditioning vectors satisfy the condition a {X^A C cr (X( 2 )) C . . . C a [X^ J ^. For the 
expression of the semiparametric efficiency bound in the sequential case, see Chamberlain 
(1992b) and Ai and Chen (2009); see also Hahn (1997) and Ahn and Schmidt (1999) 
and references therein for examples of applications. It turns out that once again the 
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information on 6 can be obtained by taking the supremum of the information on # m 
the following unconditional models : 



E[ aj (x®) gj (z,e)] = o, j = i,...,j, 

where the number of lines of the matrices Oj is fixed and equal to the dimension of 9 
and the supremum is attained for a suitable choice a\ (X^ 1 ^ , . . . , a} (X^ J ^) of optimal 
instruments. The reason why this happens in the case with nested a— fields is the fact 
that the model of interest can be written as the decreasing limit of a sequence of models 
for which a so-called 'spanning condition', similar to the one considered in Newey (2004), 
holds and the limit of the corresponding efficient scores has an explicit solution. 

In this paper we show that the information on 6q in model (JTj) can be obtained as the 
limit of the information on 6q i n a decreasing sequence of unconditional moment models 
of the form 



E 



af > 9j (Z,9) = 0, j = l,..., J, k = l,2,--- (2) 



where the numbers of lines in the matrices aS increases to infinity with k. To our best 



knowledge this result is new. It provides theoretical support for a natural solution that 
could be used in practice: replace the model ([I]) by a large number of unconditional 
moment conditions like ([2]) in order to approach efficiency. Herein we also propose an 
alternative route for approximating the efficiency bound. More precisely, we give a general 
method to approximate the efficient score, which in most of the situations does not have 
an explicit form as in the aforementioned examples. In particular, our general approach 
for approximating the efficient score brings in a new light the functional equations used 
to characterize the efficient score in the regression model with unobserved explanatory 
variables in Robins, Rotnitzky and Zhao (1994); see also Tsiatis (2006) and Tan (2011). To 
summarize, our theoretical results complete and extend existing results in the literature, 
provide new insight for the theory of semiparametric efficiency bounds literature and open 
the door to new applications, in particular in missing data contexts. 

The paper is organized as follows. Section [2] contains our main results. We show 
that under a suitable 'spanning condition' on the tangent spaces, the semiparametric 
Fisher information in model can be obtained as the limit of the efficiency bounds for 
a decreasing sequence of models. In section [3] we propose a 'backfitting' procedure, for 
computing the projection of the score on the tangent space of the model. With at hand 
an approximation of the efficient score, we suggest a general method for constructing 
asymptotically efficient estimators. In section H] we illustrate we illustrate the utility of 
our theoretical results for two large classes of models: sequential (nested) conditional 
models and regression-like models with missing data. The technical assumptions required 
for our results and some technical proofs are relegated to the Appendix. 
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2 The main results 



Let us introduce some notation and definitions, see also van der Vaart (1998), sections 
25.2 and 25.3. Given a sample space Z and a probability P on the sample space, we 
denote by L 2 (P) the usual Hilbert space of measurable real- valued functions that are 
squared-integrable with respect to P. For % a Hilbert space and S C % let S denote the 
closure of S in H. Moreover, if S C TL is a linear subspace and h G H, let U(h\S) be the 
projection of h on S. The statistical models on the sample space Z, are denoted by V, Vi, 
Vi-.- A statistical model is a collection of probability measures defined by their densities 
with respect to some fixed dominating measure on the sample space. For a model V (resp. 
Vj) and a probability measure P in the model, let Vp (resp. Vj t p) denote the tangent 
cone of the model V (resp. Vj) at P. When there is no possible confusion, we simply write 
Vp (resp. Vj,p). Let T(V, P) denote the tangent space of a model V at some probability 
measure P G V, that means the closure of the linear span of the tangent set Vp. By 
definition, both the tangent cone and the tangent space are subsets of L 2 (P). Herein the 
vectors are column matrices and A G W x M s means A is a r x s— matrix with random 
elements, if not stated differently. For A G W x W, E(A) denotes the expectation of A 
and E~ X (A) denotes the inverse of the square matrix E(A). Finally, for a square matrix 
A, let A~ denote a generalized inverse, for instance the Moore-Penrose pseudoinverse. 

2.1 A general lemma 

The following result is a generalization of Theorem 1 in Newey (2004) where only the case 
of conditioning vectors j — 1, • • • , J, that generate the same a— field is considered. 
The proof of our result is postponed to the Appendix. 

Lemma 1 Let P G V C V\ he the true law of the vector Z G Z and 8q = ip(Po) for a 
map ip : V\ — > M. d differentiable at P relative to the tangent cone Vi t p . Let {Vk} keN , be 
a decreasing family of statistical models such that 

oo 

?lD? 2 D...D? 3 V k+ i D . . . D P| V k D V 3 P (3) 

k=i 

and 

oo 

n k = r, (4) 

fc=l 

where T = T(V, P ) and T k = T (V k , P ) , keN*. Then 

I 0o (V) = lim I 9o (V k ) , 

fe— >oo 

where Ig Q (V) stands for the Fisher information on 9q = if) (Pq) in the model V. 
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For the definition of the Fisher information Ig Q (V) on 6 = ip (P ) in the model V 
we refer to Bickel, Klaassen, Ritov and Wellner (1993) or van der Vaart (1998); see also 
Newey (1990). When the models Vk, k £ N*, are defined by an increasing number of 
moment conditions with the same conditioning vectors, condition (j3J) is exactly the so- 
called spanning condition of Newey (2004). 

Remark 1 Even if HfcLi ^ = 'P '> condition @ is not necessarily fulfilled. To see this, 
consider a symmetric density on the real line and let s\, S2 be two odd functions such 
that \s\\,\s2\ < 1 (e.g. si(x) = a; 2J ~ 1 I{| a .|<i} J I = 1,2). For any k £ N* and t £ [—1,1], 
define 

f t (x) = f (x)[l + t s 2 (x)], f t]k (x) = kf (kx)[l+t Sl (x)) 

and consider the following models defined by theirs densities with respect to Ar the Lebesgue 
measure on the real line : Q k = {ft-,k • Ar : t £ [—1, 1]}, k £ N*, and 

oo 

V = {ff\*:te [-1, 1]}, V k = V U |J Q m , k £ N*. 

m=k 

Then we have 

oo 

V x D V 2 D . . . D V k D V k+1 D . . . D P| V k = V. 

k=l 

To describe the corresponding tangent spaces, notice that 

Wk>l, d t log f t . k (x)\ t=0 = si (x) and d t log f t (x) \ t=0 = s 2 (x) , 
and thus V = {a s 2 (x) : a £ M.} , 

V k = {a s 2 (x) : a £ K} U {b s 1 (x) : b £ R} , k £ N*. 
Then T = {a s 2 (x) : a £ R} , 

% = {as 2 (x) + bs ± (x) : a,b £ R} , k £ N*. 

This shows that 

oo oo 

f]V k ^V and p| T k 2 T, 

k=l k=l 

even if the decreasing sequence of models {Vk} k& n* ^ s suc ^ that f]^ =1 Vk = V. 
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2.2 Efficiency bound 

The main idea we follow to derive the semiparametric efficiency bound for the parameter 
6*o is to transform the finite number of conditional moment restrictions (JT|) in a countable 
number of unconditional (marginal) moment restrictions. Next, for any finite subset of 
these unconditional moment restrictions, one could easily obtain the Fisher information 
bound. Eventually, one may expect to obtain the semiparametric efficiency bound for the 
model ([1]) as the limit of the efficiency bounds for a decreasing sequence of models defined 
by an increasing sequence of finite subsets of unconditional moment restrictions. Remark 
[1] proves that in general this intuition is not correct. However, Lemma [1] states that this 
intuition becomes correct under the additional condition 01]). 

Let us introduce some more notation. If ( : Z x © — > R m , m > 1, is some given 
function of Z and 9 and X is some subvector of Z, we denote 



E[d e ,( | X] = E[d e/ ({Z,0 o ) I X] = -^E[((Z,6 ) | X] 



G M x R m , (5) 

0=00 



when such derivatives of ^ i — ?- E[((Z, 9) \ X] exist. A similar notation will be used with the 
conditional expectation E{- \ X) replaced by the marginal (unconditional) expectation 
with respect to the law of Z. Let us point out that the maps 6 H- ((z,9) may not be 
everywhere differentiable. Next, let us define 

g = (g[,--- ,g'j)' eR p = R p ^ + - +p -', 

and let X_ denote the vector of all components of Z contained in the subvectors X^\ 
3 = 1, • • • , J. 

For the purpose of transforming conditional moments in unconditional versions, con- 
sider a countable set of squared integrable functions W = {wk ■ k e N*} C L 2 (P ) such 
that linW = L 2 (Po), that is the linear span of W is dense in L 2 (Po). For any s G N*, 
define apx p— diagonal matrix 

w s (X) = dtag(E [w 8 (Z) |X (1) ] ,...,E [w s (Z) |X (1) ],... 

v v ' 

pi 

... J E[w s (Z)\X^],...,E[w s (Z)\X^}). 

v ' 

p. J 

Next, for any k <EW, let 
w {k) {X) = (w^X), ■ ■ ■ ,w k (X))' e R kp x R p and g£ (Z, 9) = w {k) (X) g (Z, 9) G R kp . 

(k) 

Moreover, let I be the Fisher information on 8 in the model 

£ [gj (Z, 0)1 = 0, (6) 
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that is 



7-M 

L 0o 



E 



<9< (Z, 9 ] 



v 



f k (z, e 



E 



de>g™(Z,9 ) 



See Chamberlain (1987), Newey (2001), see also Chen and Pouzo (2009) for the non- 
smooth case. 

We can state now the main result of the paper. 

Theorem 1 Under the Assumptions T and SP in the Appendix, the information bound 
Ig on 8 at Pq in model ([I]) is given by 



l 0U 



lim /, 

fc— >oo 



(k) 

where, for any k G N* , 1$ is the Fisher information on 6q in the model defined as in (\oj). 

Proof. For any k G N*, let V k be the model defined by equation ([6]) and V the model 
defined by equation (TI|) . Then 



V l DV 2 D ...DV k D V k+1 D • • • D P| V k = V. 



k=l 



Hence the stated result is a direct consequence of Lemma [TJ provided that condition (J4]) 
holds for the tangent spaces of V and Vk, k G N*, at 9 . 

For each j G {1,..., J}, any z <E Z <Z W could be partitioned in two subvectors 
yti) e anc i x ti) e w ith x^ in the support of X (i) . Let P x(j) denote the law of 

Model V is then defined by the set of conditions 



J 9j {z, 6) f (z, 6) dy® = P xU) - a.s., j e {!,..., J}; 



(7) 



for a fixed k, the model Vk is defined by 



J 9j (z,9) f(z,6) wi(x^) dz = 0, je{l,...,J}, se{l,...,k}, 



where 



w{ (x {j) ) = E [w s (Z) \X® = x {3) ] . 

Consider now a regular parametric family {ft} te /_ £ E ) of densities satisfying that means 
that there exist parameters 6 t G 0, such that, for any t G (—e,e) and PxU) — a.s., 



J g 3 (z, 9 t ) f t (z, 9 t ) dy® = 0, Vj G {1, . . . , J} . 



(9) 
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Let 



s = d t \og f t (Z,9 )\ t=0 , S 6o = d e log f(Z,9)\ 9=9o , 

si = d t \o g f t (z,e t )\ t=0 = s + s' e J. 

Here and in the following, the derivatives of the log-densities are to be understood in the 
mean square sense, see Ibragimov and Has'minskii (1981), page 64. Differentiating with 
respect to £ in (J2J) we obtain 

E [d m (Z, 9 ) \X®] 9 + E [ 9j (Z, 9 ) Sl (Z) \ X®] =0, Vj G {1, . . . , J} . (10) 

Since 9 G IR d could be arbitrary, we deduce that for each j G {1, . . . , J}, 

E [d vgj (Z, 9 ) | X&] + E [ 9j (Z, 9 ) S' 6o (Z) \ X®] = 0, E [ 9j (Z, 9 ) s(Z) \ X®] = 0. 

The last equation and the expression of the score functions s\ suggest a tangent space 
T = T(V,P ) of the form 

T = ^S 6a + {s : E(s 2 ) < oo, E( 8 ) = 0, E [gj (Z, 9 ) s{Z) \ X®] =0, 1 < j < J} . 

(11) 

On the other hand, the tangent space Tk = T (Vk, Pq) corresponding to the model defined 
by the equations flSJ is given by vectors satisfying the unconditional moment equations 

E [dffQj (Z, 9 ) wl (!»)] 9 + E [ 9j (Z, 9 ) Sl (Z) wl (X®)] = 0, (12) 

1 < j < J, 1 < r < k. This yields the tangent spaces 

T k = \^S 6o + {s : E(s 2 ) < oo, E(s) = 0, E [ 9j (Z,9 ) s(Z) w{ (X®)] = 0, 

V 1 < j < J, V 1 < r < k} ; 

see for instance Example 3, section 3.2 in Bickel, Klaassen, Ritov and Wellner (1993). 
Since the functions Wk (Z), k G N*, span L 2 (Pq), their projections w J k (X^) on L 2 (P X U)), 
k G N*, will span L 2 (P X ( 3 )). Consequently, equations (flOj) are satisfied if and only if 
equations ffT2^) are satisfied for any k G N*. In other words, the equivalent of the spanning 
condition of Newey (2004), see our equation (j4j) above, is satisfied and we can apply 
Lemma [T] to conclude that Ig = lim Ig k \ 

The proof will be complete if we show that the tangent space T — T (V, Pq) is indeed 
the set described in equation (ITT]) . Consider for simplicity that J = 2, the general case 
could be handled similarly. It is quite easy to see that equations f lT0|) guarantees the 
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inclusion "c" in display (jlip . To show the reverse inclusion, it suffices to prove that 
T 1 C T, where 

T' = T' (V, P ) = {s: E{s 2 ) < oo, E(s) = 0, 

E [ 9l (Z,9 ) s{Z) | X«] = 0, E [g 2 (Z,9 ) s(Z) | X®] = 0} . 

Let fo denote the true density of the vector Z. Take s G T' and suppose for the moment 
that s is bounded. Then, for real numbers t with sufficiently small absolute values, the 
functions f t — (1 + 1 ■ s) fo are densities on Z and if Ef t denotes expectation with respect 
to the law defined by ft, 

E ft [ 9j (Z, 9 ) a (X®)] = E [ 9j (Z, o ) a (X®)] + t E [ 9j (Z, 9 ) s (Z) a (X®)] = 0, 

for any square-integrable function a(X^), so that Ef t [ 9 j (Z, 9 ) \X^} = 0, j = 1,2. 
Moreover, 

dt log/t| t=0 = d t log (1 + 1 ■ s)\ t=0 = s, 

which means that the family of densities {ft}\ t \< £ defines a submodel of model ([1]) for 
which the tangent vector at t = is exactly s. Next, we have to extend the argument 
to unbounded functions s. If M. C L 2 (P ) is the subspace of bounded functions of Z, it 
remains to show that M. fl T' is dense in T 7 . One may consider this step obvious since 
any unbounded square integrable function can be approximated by a sequence of bounded 
functions, see for instance Ai and Chen (2003), page 1838. We argue that this well-known 
approximation result cannot be directly applied to our context, as it is also the case in 
other contexts considered in the efficiency bounds literature. Indeed, here we are in the 
following situation: we have two infinite-dimension closed subspaces T{ and 7^' such that 
V = V n V, M n T[ = T{ and M nT' 2 = T 2 f , and we need that M(lT' = V . To our 
best knowledge, there is no general mathematical result which would allow us to claim 
that M. fl T' is dense in T' without any further argument. That is why we have to provide 
a proof adapted to the case we consider herein. By Assumption T and the subsequent 
remark, and equation ( 128]) . there exist two bounded vector functions b\ and b 2 defined 
like in equation ( 1281) such that, for i,j e {1, 2}, i ^ j, 

E(g l b' l \X^)=0 and H^" 1 ( 9i 6j | X« X®) 1^ < I, 

where 9 i = 9 i{Z, 9q). Here and in the sequel, the norm of a vector (or matrix) should be 
understand as the sum of componentwise norms. Since Ai is dense in L 2 (Po), for a fixed 
s G T' there exist a sequence {t n } n C Ai such that 

II s — tn\\ L 2( P \ 0. 

v v > n— too 

Define 

u n = t n - E {t n g[ | X^) E~ l (b ig [ | X«) h-E (t n9 ' 2 | X™) E~ l (b 2 g' 2 \ X®) b 2 . 
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It is clear that we can take {t n } n C M. such that 

\\E(t n9l |X( 1 ))|| oo +|| J E(t^ 2 |X( 2 ))|| oo <oo 
and thus u n G M.. Then 

E( 9l u' n \xU) = E ( 9l t' n | £Wj ~ g (gji I ^ Mi I E ( 9l t' n | XW) 

-£ [ 9l b' 2 E-' (g 2 b' 2 | X^) E (g 2 t' n \ X< 2 >) | X«] 

—E [ £ | X^X®) E^ {g 2 b' 2 \ X&) E (g 2 t' n \ X&) | X« ] 

V v ' 

=0 

= 0, (13) 

and similarly, 

E (g 2 u' n | X®) = 0. (14) 

Moreover, 

S l^n S t n -\- t n M n 

= s - t n + E [(t n -s)g[\ X«] E" 1 (b ig [ | X«) h 
+E[{t n -s)g' 2 \X^] E' 1 (b 2 g 2 \ X {2) ) b 2 , 

which entails 

\\s-u n \\ L2{Po) < \\s-t n \\ L2{Pa) + \\E[(t n -s)g[ I ^ (1) ]|| L 2 (Po) • ll^lloo 
+ \\E[(t n -s)g' 2 \X^]\\ L2{pQ) .\\b 2 \\ 00 . 

Noting that 

\\E[(t n -s)g[ \ XW]\\ 2 L2(Po) = E{E 2 [(t n -s)g[\XM]} 

(Cauchy - Schwarz) < E {E 2 [{t n - s) | X«] E 2 (g[ \ X«)} 

< \\E ( 9l \ X^)\\l E{E 2 [(t n -s) |XW]} 
(Jensen) < \\E ( 9l | X«) |£ E {E [(t n - s) 2 \ X«] } 



< 



oc 
2 



l^(^l^ (1) )IL Wtn-S\\ 2 L2{P0) 



we finally obtain \\s — u n \\ L 2r Po \ — > as n — > oo. In particular, deduce that E(u n ) — > 
0. Now, since all the previous equations and inequalities involving u n hold also with 
u n replaced by u n — E(u n ), deduce that {u n — E(u n )} n C M. fl T', which implies that 
s G M. H T'. Now the proof is complete. ■ 

In the general theory of efficiency bounds, the semiparametric Fisher information on 
a finite dimension parameter in a semiparametric model is the infimum of the Fisher 
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information over all its parametric submodels; see for instance Newey (1990). For models 
defined by conditional moment equations, Theorem [T] shows that the same semiparametric 
Fisher information can be alternatively obtained as the lower limit of the semiparametric 
Fisher information in a sequence of decreasing supra-models. The main reason for this 
is that with such decreasing sequence of supra-models, the 'spanning condition' (j3J) holds 
true. Moreover, since L 2 (P ) is a separable Hilbert space, Theorem [T] can be restated 
under the following equivalent form. 

Corollary 1 Under the conditions of Theorem^ 

Ig = SUp Ig (b) , 
6GB 

where 

B={(h ,...,bj (X^)) : b jtlk G L 2 (P xW ) 1 < I < d, 1 < k < Pj , 1 < j < J} , 

so that any b = b (X) G B is a d x p— matrix with random elements, and Ig Q (b) is the 
Fisher information on 9q in the model defined by the marginal moment restrictions 

E [b 3 g 3 (Z, 9)]=0, j G {1, . . . , J} , (15) 

model which can also be written under the compact form E \b (X) g (Z, 9)\ = 0. 



Remark 2 We argue that, under further assumptions, the result of Theorem U\ extends 
to the case where the unknown functions gj depend also on a same unknown function h 
of the observations and the parameter. More precisely, when the model is defined by 

E [g, (Z, 9, h (Z, 9)) | X®] =0, j = 1, . . . , J, (16) 

where gj : W x R d x M. Ph — > M. Pj , j £ {1, . . . , J}, are known. With the same notations used 
for defining g™, let 

l w k (Z,9,h(Z,9)) = w {k) (2Ql(Z,9 1 h(Z 1 9))eR k P, VkeW, 



where g = (g{, . . . , gj)' and let 1^ be the Fisher information on 9 in the model 



E 



j£(Z,9,h(Z,9)) 



0; 



(17) 



its expression as a solution of a variational problem can be found in Chamberlain (1992), 
Ai and Chen (2003) or Chen and Pouzo (2009). 

Similar but more involved arguments can be invoked to show the following result, which 
we state here as a conjecture: the information Ig on 9 at P$ in model (1 1 6i) is given by 

m 



lim la 

k— >oo 



where if^ is the Fisher information on 9q in model §T% . 
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3 Efficient estimation 



To simplify the presentation, let us take J = 2. To obtain an efficient estimator, a 
common way is to solve 9 from the efficient score equations; see van der Vaart (1998), 
section 25.8. By definition, the efficient score is the componentwise projection of the score 
Sq on the orthogonal complement of the tangent space T = T(V, Po) defined in equation 
(ITT]) . In the projection of S$ on 1~ ± only the nonparametric part of the tangent space 
matters. Moreover, the projection of Sg is componentwise. It is then common practice 
in the literature to identify T with the subspace of {L 2 (Po)} d = ®t=i ^ 2 (Po) obtained 
as the d— fold cartesian product of the nonparametric part of T . Here the direct sum of 
Hilbert spaces is considered with the usual inner product ((<f>i, ■ • • , 4>d), C0i, • ■ ■ , i>d)) — 
^i) + ■ ■ ■ + (<j)d,i^d)- Therefore we will slightly change our notation for the tangent 
spaces. More precisely, let us define 

T = LeQ)L 2 (P )-. E(s) = 0, E( gi (Z,6 )s f (Z)\X^)=0, i = l,2 

I k=l 

= TinT 2 , 



where, for i = 1, 2, 

T t = ^se($)L 2 (P ): E{s) = 0, E( gi (Z,9 )s'(Z) \ X®) = o| , 

so that 

Ti~ = |sg © L2 ( p °) : S ^ = ai 9i(Z,9 )^ . 

Clearly, T 1 = 7? + 7?. 

In general, the projection of Sq on T x is not explicit. To approximate this projection 
and to further build an asymptotically efficient estimator for model flTJ, we use the iterative 
( "backfitting" or successive approximation) procedure considered in Theorem A. 4. 2 of 
Bickel, Klaassen, Ritov and Wellner (1993), page 438; BKRW hereafter. Let Hi = % x , 
gi = g(Z, 9 ), i = 1, 2, and let E(deg' i ) be the transposed of the matrix E{de'gi) defined in 
equation (jSJ). The steps of the procedure we propose are the following : 

1. Set m = 0. Take = 0. 

2. Put m = m + 1. Calculate 

-7t("0 (m) ( v (l)\ i (m) f v (2)\ 

S 6o = «i ( x( ') 9i + a 2 ( x ) 92 
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where 



a! m) (X«) = aS m) (X« O ) = -E I ^ (ft | X«) 

+£ [E | X( 2 )) V- (ft | X( 2 )) ft g[ | X«] 7~ (ft^ 1 )) 



and 



4 m) (x( 2 )) 



4 m_1) ^ ^ I * (2) v (^|x( 2 )) ft ^ i x« (ft|X«) 



.("») 



(l( 2 '^o) = -i?N|X (2) ) V-(ft |X< 2 )) 



l 9i 92 



3. Repeat from step 2 till the convergence of . 

Let II (s\S) denote the (componentwise) projection of a vector s G ©f =1 £ 2 (-Po) on a 
subspace 5 C ©f =1 £ 2 (-Po)- Theorem A. 4. 2 (A) from BKRW directly yields the following 
result. 

Lemma 2 Assume that the conditions of TheoremUl hold true. When m — > oo, 



S£ ] = at l \X^) gi + at\X^)g 2 — > ~Sg = II (Se^T 1 ) = U (Se \Ih + Ih) 
m 0fe=i ^ ( p o), where ft = g(Z, 9 ), i = 1, 2. 

Let us point out that even if Lemma [2] guarantees the convergence of the iterations 
, it is not necessarily true that the sequences a( n) (X«) ft and a { ™ ] (X^) g 2 con- 
verge. Sufficient mild conditions are provided in Theorem A. 4. 2 (C) of BKRW, that 
are 

S 6o = n (^17^) = a* (X«) -ft + a* 2 (X®) ■ g 2 G 7^ + 7^ (18) 

with aj (X^) -ftGT^n (7^ n 7^ ± ) ± C 7^. Moreover, by Proposition A.4.1 of BKRW, 
condition (TLB"]) is equivalent with the existence of a solution a\g\ and a 2 ft for the system 



where 



a\ (X«) g\ = p\ — E [a\ (X< 2 )) g 2 g[ | X«] V" (ft^ 1 )) ft 
^2 (* (2) ) 92 = P2 - E [aj (XW) ft s 2 | X( 2 )] V- (ft|X( 2 )) ft, 

Pi = Pi (z,e ) := n(5 eo |7- ± )=E(^|x«) ^-fe|x«) ft 

= —E (d^ | X«) V- ( 9l | X«) ft. 



(19) 
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(A careful inspection of the proof of Proposition A. 4.1 of BKRW shows that condition 
Hi + H 2 = Tf 1 + T 2 ± is a closed subspace is not necessary for deriving that result, since 
what is really used in their proof is the relation H^~ D H 2 = (Hi + H^). If in addition 
the system (fT9l) has a unique solution, the backfitting algorithm above is nothing but a 
convergent iterative procedure for finding it. 

In applications, a convenient way to check uniqueness is to prove a contraction prop- 
erty. This is the case for instance if Tf 1 H = {0}, which in our framework holds 



(in the sequential case, this can be achieved by writing the initial system in an equivalent 
form satisfying the orthogonal condition above; see subsection 14.11) . 

In the general case where D ^ {0} the system (1191) rewritten as in Proposition 
A.4.1 of BKRW under the form 



does not necessarily have the contraction property. In our problem h\ = a\gi and h 2 = 
a 2 g 2 with gi and g 2 given. Hence it suffices to check a contraction property for a\gi and 
a 2 g 2 or some given transformations of them. We will see in subsection 14.21 that in the 
regression-like models with missing data framework, see Robins, Rotnitzky, Zhao (1994), 
the equations (|T9|) lead to a contraction property for some given transformations of a\gi 
and a 2 g 2 - 

The "backfitting" algorithm we proposed above involves 6q that is unknown. In prac- 
tice one can use the following steps: (i) build 9 n a \/n— consistent estimator of 6q, for 
instance the smooth minimum distance estimator (SMD) like in Lavergne and Patilea 
(2008); (ii) estimate nonparametrically a^ m ' and a 2 m the solution of the "backfitting" 
algorithm obtained after, say, m* iterations using 9 n instead of 6*0; and (iii) construct an 
efficient (classical GMM or SMD) estimator #(' m *) based on the approximate efficient score 

equations E ( S$ J =0, where 



4 Applications 

In this section we illustrate the utility of our theoretical results for two general classes of 
models: sequential (nested) conditional models and regression-like models with missing 



if 



E(gi g' 2 \X^,X^)=0 
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data. The general results in sections [2] and [3] above allow us: (a) to complete a semipara- 
metric efficiency bound result of Chamberlain (1992b); and (b) to generalize the mean 
regression with missing data setting of Robins, Rotnitzky and Zhao (1994) and Tan (2011) 
to more general moment conditions, which includes for example quantile regressions. 



4.1 Sequential conditional moments 

Important cases where equations (1X9!) have an explicit solution are the cases where 
a (X«) C a (X^) holds true. In the case J = 2, the model E( gj (Z,6) | X®) = 0, 
j = 1,2, defined in (pQ) can be equivalently written under the form 



(20) 



E(g 1 (Z,9)\XW)=0 
E(g 2 (Z,9)\XW)=0, 

where 

Hz,e) = gi (z,9)-E( gi {z,e ) g ' 2 (z,e ) | x^) v^ 1 (g 2 (z,e ) I x^) 92 (z,e). 

Here we suppose that V (gi(Z,6o) \ X^) and V (g 2 (Z,6o) \ X^) are invertible and this 
guarantees that 9q is also identified by the equations ( 120]) . Recall that gi is a short notation 
for gi(Z, 9q) and similarly let & replace gi(Z,9 ). 

Notice that e/i is the residual of the projection of g\ on g 2 with respect to a (X( 2 )) and 
E (<7i g' 2 | X^) = 0. Let T x be the tangent space of the model defined by the first equation 
in ( 12"U|) . By the definition of gi, it is quite clear that condition (^T 2 ± = {0} holds true. 
Next, multiplying the ith equation in ([191) by gi, taking conditional expectation given 
and finally multiplying by V~ l (gi | JW), i = 1,2, the system (fT9|) corresponding to 
model (1201) becomes 



al(xW) = -E (d e gl \ XW) V" 1 (&| X«) 

-£ (a* (X< 2 )) • «7 2 # | IW) V" 1 

5* (XM) = -E (d e g> | X^) V" 1 fo 2 | X( 2 )) 

-E(al(xm).g ig i\XW) V^{g 2 \X^). 



(21) 



Since by definition E(a*(X^)g 2 g{ \ X«) = E[a* 2 (X^)E(g 2 g{ \ X&) | X«] = and 
E (al (X«) g x g' 2 \ X^) = a\ (X^) E fa g' 2 \ X^) = we obtain 

at(XW) = -E{d e g{ | X«) V~\ gi | X«) 

s;(x (2) ) = -£(« I xw)v-\ g2 1 x( 2 )). 

(E(dogl) denotes the transposed of the matrix E(dgigi).) The efficient score Sg can then 
be written as 

= al (X) • g x + a 2 (X) • g 2 

-E (d e g[ I V~* (g, | X«) g\ — E (d e g> \ X^) V~ l (g 2 \ X®) g 2 . 
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In the particular case where = X^ 2 ' = X, 

S do = a\ (A) -gi + a* 2 (X) ■ g 2 

-E (d e gl | X) V- 1 (ft | A«) g x -E (d e g' 2 \ X) V~ l (g 2 \ X) g 2 
-E(d e g{\X) V y-iffgi \ ]x \ f 9i 



-E (d e g' 2 | X) J \ \ g 2 J J \ g 2 

-E {d e g' C (X) | X) V" 1 (C (X) g\X) C{X)g 
-E(d 9 g'\X) V- 1 (g\X)g, 



where g' = (g[ g' 2 ) and 

C(X) 



I -E ( 9l g' 2 | X) V- 1 (g 2 | X) 
/ 



is a nonsingular random matrix. This expression of the efficient score directly yields the 
efficiency bound derived in Chamberlain (1987). 

Another important particular case of formulae (122j) is provided by models defined by 
sequential conditional moments; see Chamberlain (1992b), Ai and Chen (2009). Taking 
X^ = X 1 and X^ = (X' 1 ,X' 2 )', one obtains 

S 9o = a* (X) -gi + a 2 (X) ■ g 2 



= -E(d e gl | Ax) V- 1 ®! | Ax) g x - E (d e g' 2 \ X h X 2 ) V~ X (g 2 I Xx,X 2 ) g 2 . 

Let us point that Chamberlain (1992b) only proves this result for discrete distributions 
and Ai and Chen (2009) obtain the result in a more general framework (allowing for 
unknown infinite dimensional parameters in the equations defining the model) but under 
slightly more restrictive assumptions than in our setting^ 

4.2 Regress ion- like models with missing data 

Consider now a regression- like model defined by the equations 

E\p(Y,X*,a) | A*] = 0, (23) 

where /?(•, •, •) is some measurable vector- valued function, a is a (finite-dimension) vector 
of parameters, and the vector (Y', A* ') = (Y', A', V) is not always completely observed. 

x Ai and Chen (2009) implicitly require that the class Q appearing in their Assumption A in the 
Mathematical Appendix is the same for each value of their model parameter a. This variation independent 
parametrization assumption represents an additional restriction that is unnecessary in our approach. See 
also van der Laan and Robins (2003), page 18, for some lucid comments on the existence of a variation 
independent parametrization. 
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We also assume that a non-missing indicator 5 and some other variable V° are always 
observed. In the following examples we consider two random missingness mechanisms 
considered respectively by Tan (2011) and Robins, Rotnitzky and Zhao (1994). 

Example 2 (i) The vector Y is observed iff8 — l; 

X 



(ii) The vector W 



V 



Example 3 (i) Let X 

(ii) The vector W = 



is always observed and we have 

P(5=l \ Y,W) = P (5 = 1 \W)=tt(W). 
X \ 

y \ where X is observed iff 5 = 1; 

is always observed and we have 
P (5 = 1 | X, W) = P (5 = 1 | W) = 7T (W) . 



Let cto be the true value of the parameter identified by the model 
23D and each of (pi) or (1251) imply 



(24) 




E 



8 



_7T(W) 



p(Y,X*,a ) | X* 



0. 



(25) 

The equation 
(26) 



We can consider this equation at the observational level even for missing X*, since for 
missing values of X* we have 5 = which renders the equation noninformative. Note also 
that f !24p and ( 12 5 p can be written under the unified form 



P(S = 1\ Y,X*,W) = tt{W). 
Therefore, at the observational level, with any of the two examples we obtain a model like 

5 



E 



E 



n(W) 
5 

_tt(W) 



p(Y,X*,a ) | X* 







(27) 



1 I W 



0. 



Moreover, like in Graham (2011, footnote 8, page 442), it can be shown that at the 
observational level, a model given by equation (12"B"|) and any of the missing data mechanism 
described in Example [2] or Example [3] is equivalent to the model defined by (1271) . 

With our notation, Z is the vector built as the union of all the variables contained 
in Y, X\ W and 6, 9 = a, g x (Z, 6) = {5/n (W)}p (Y, X*,a), g 2 (Z, 9) = {5/n (W)} - 1, 
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= X* and X^ = W. Let p be a short for p(Y,X*,a ). Then the functions a* 
and defining the efficient score are given by the following equations obtained (see also 
equations fl2T]) ) from equations f|T9|) : 

aJpT) = a{(X w ) 

= -E (d a p' I X*) E~ l pp'\X* 



5 



a*(W) = a* 2 {X^) 

= -E [a* (X*) p\W]. 



W 



E- 1 



7T(W) 



W 



In the particular case where p = p(Y, X*, a ) = Y — g (X*, a ) and the selection proba- 
bility 7r (W) is known, these are exactly the equations obtained in Robins, Rotnitzky and 
Zhao (1994). They showed that for the regression case, the equation for a\ corresponds to 
a contraction (see the proof of their Proposition 4.2). In subsection 15.31 in the Appendix 
we show that such a contraction property holds for a more general p. Hence we could 
include in our framework further interesting examples, e.g. quantile regressions. The 
contraction property allows to solve the equations in a\(X*) and a^iW) by successive 
approximations. 

Let us consider the extended framework where the selection probability is known up 
to an unknown finite dimension parameter 7 , that is 

P(S = 1 | W0=7r(W j7 o), 



(see also Robins, Rotnitzky and Zhao (1994), equation (18)). In subsection 15.41 in the 
Appendix we show that the efficiency score for ao has the same expression regardless the 
selection probability function ir is given or depends on the unknown parameter 7 . Thus, 
we extend a result of Robins, Rotnitzky and Zhao (1994), see also Tan (2011), obtained 
in the particular case of mean regressions. 

Let us close this section with a remark. Robins, Rotnitzky and Zhao (1994) considered 
the case where missingness arises only in covariables X* (that is also the case considered in 
our Example |3]) and derived the efficient score equations. Tan (2011) obtained formally the 
same equations with missing regressors and missing responses (the case corresponding to 
our Example H]) using the corresponding definition of W. However, there is an important 
difference between the Examples [2] and [3j In the possibly missing responses case we 
have cr (X*) C cr(W), so that Example [2] falls in the sequential conditional moments 
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framework where the solutions for a\ and a\ are explicit. Such explicit solutions are no 
longer available in the framework considered by Robins, Rotnitzky and Zhao (1994) and 
in our Example [3l 
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5 Appendix 

5.1 Additional proofs 

Proof of Lemma [TJ. By definition (van der Vaart (1998), pp 363), there exists a 
continuous linear map ip : L 2 (P ) — > M. d such that for any g G P Po C L 2 (P ) and a 
submodel (—e,e) 3 t P t with score function g, 

t t^o 

By the Riesz representation theorem, there exists a unique d— dimension vector-valued 
function having the components in L 2 (P ) such that ip (h) = E Po (iph) for every h G 
L 2 (Pq). In particular, 

i> (g) = E Po $g) = J IfgdPo, \/g G P Po C L 2 (P ) . 

Let ip and ip k denote the elements of [L 2 (Po)] d obtained by componentwise projections 
of ip on the tangent spaces T C L 2 (P ) and % C L 2 (P ), respectively. The Fisher 
information matrices on 8q = ip (Pq) in the models P, Vk at Pq are then defined by 

Ie l (V) = V Po (pj = E Po (W) , 1^ {Vk) = V Po , k G N*. 

From 

oo 

V 1 DV 2 D ...DV k DV k+1 D ...D f]V k DV 

k=l 

we deduce that 

oo 

Pi D 7> 2 D . . . D P fc D n +1 D . . . D f| 7> fe D P, 

k=l 

and 

oo 

Ti d r 2 d . . . d 7i. d r fc+ i d . . . d p| r fc = r, 

fe=i 

where the last equality is due to (j3J). By Lemma 4.5 of Hansen and Sargent (1991), 
lira # (P fe ) = 1mJp (J] WT*)) = V* (n {W)) = V Po = # (P) . 
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5.2 Assumptions 

For a subset A C suppZ, we use the following notations : g^ A = g%{2, o )I{zeA}, i = 1,2, 
and 

h = g lA - E(g hA g' hA I * (1) ,* (2) ) ^ I X (1) ,X (2) ) (28) 

(z, j) G {(1, 2), (2, 1)} where i? -1 <?' |j4 | X^ l \X^>) stands for the inverse of the matrix 
E (gj,A g'j^A I X^\X^) that is supposed to exist. 



Assumption T 

There exist a subset A C suppZ such that for % = 1,2, is a bounded function 
and 

1. E (g iA g' hA I X^,XW) is invertible and (g itA g' hA I X (1) ,X (2) )|L < °°' 

2. H^- 1 (ft* ^ | XW) < oo with 6 4 defined in (KB). 

Remark 3 Under Assumption T and for any a > 0, by the definition of hi, for G 
{(1,2), (2,1)}, 

E { gi (Z,9 ) ab' 3 | XW,X (2) ) = E (g i>A ab' 3 \ X {1 \X^) = 0, 
and, for % — 1,2, 

E (g i>A ab[ | X®) = aE (h b[ \ X®) . 

Therefore, in the proof of Theorem [7J up to a suitable scaling factor, we can choose b\ 
and b 2 such that conditions ( Iff.jjj) are satisfied. 



Assumption SP 

1. The models V defined by (pQ) and Vk defined by (jSj), with G N*, can be 
written in the semiparametric form 

V = {P e ,r, ■ 6ee, V eH}, V k = {Pg, v : 6 G 6, n G H k } , k G N*, 

and satisfy the assumptions of Lemma 25.25 (page 369) of van der Vaart (1998). 

2. The Fisher information matrices Ig and 1^ on # m models V and P*. respec- 
tively, for any k G N*, are well defined and nonsingular. 
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To guarantee Assumption SP.2 it suffices to suppose that for any 1 < j < J: (i) 
\\V(gj{Z, 9 ) | X^JHoo < oo; (ii) the maps 9 ^ E( 9j (Z, 9 ) | X® = x&) are differentiable 
for P X (])— almost all x^>\ and (hi) the information matrix 

E {E [(d e , 9j (Z, 9 ))' I X®] V~ [g 3 (Z, 6 ) \ X®] E [dygj (Z, 6 \ X®)] } 

is non singular. 

A consequence of Assumption SP (see Lemma 25.25 of van der Vaart (1998)) is that 
the parameter defined by if} (Po,ti) = 9 is differentiable at P = P$ 0)rio with respect to 
the tangent space T = T(V,Po). It also ensures that the tangent space T can be 
written as the sum of the finite dimensional subspace spanned by the components of the 
parametric score Sg and the tangent space T' corresponding to the nonparametric part 
V' = {Pe ,n : V G H} of the model V : 

T = \inSe Q + T' . 

Note that this assumption does not necessarily mean that the parameters 9 and rj are 
completely separated. In fact 9 and rj are connected since the functional parameter rj 
can have 9 among its arguments. Assumption SP only means that when considering the 
density of Pg tV with respect to a dominating measure \i we could write it under the form 

f(.,9, v (v(-,9))), 

with / and v having a known form, where / (•, $o, V (v (•, $o))) and / (■, 9, rjo (v (•, 9))) 
belong to the model V for every 9 E Q and r] 6 H . For example, in the conditional mean 
setting with one conditioning vector 

E[Y-m(X,9) | X] = 0, 

we can take H as the set of zero conditional mean densities of Z = (Y',X')', i.e. 

H = \p(y,x) -j(x) : p > 0, 7 > 0, p(y,x)dy = l, / yp (y, x) dy = 0, Wx, 



7 {x) dy = 1 

and v (y, x,9) = (y — m (x, 9) ,x), so that 

r] (v (z, 9)) = r] (y — m (x, 9) , x) = p (y — m (x, 9) , x) ■ 7 (x) 

and 

f(z,9, v (v(z,9))) = v (v(z,9)). 

In the proof of Theorem [1] we identify the density f(-,9,r](v(-,9))) with the infinite di- 
mensional nuisance parameter rj which is itself a density. 
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5.3 Contraction property in regression-like models with missing 
data 



With the same notation of subsection 

al(X*) = E\E[aZ(X*) p(Z,9 



we shall prove that the equation 
1 -tt(W) 



W] 



tt(W) 



p'(Z,9 ) | X* 



(29) 



x E~ 



_7T(W) 



p(z,e ) P '(z,9 ) | x* 



has a unique solution which can be obtained by successive approximation, under the 
additional assumption 

inf7r(» = 1 -/? > 0, (30) 

w 

the infimum being taken over all possible values of W. For simplicity, in the reminder of 
this subsection we drop the arguments of the functions. Let p = ix~ l l 2 p. Assuming that 
E (p p ' | X*) is invertible, equation (T291) can be equivalently written under the form 



«i P 



E 



E{a* lP \ W) 



1 



7T 



7T 



P 



X* 



ET 



X' 



P 



= E [E (a* p | W) (1 - tt) p ' | X*} E' 1 {pp'\X*) p 
=■ T(alp). 

We will show that the map T is a contraction. Before that, let us state a Cauchy- 
Schwarz inequality for matrix valued random variables, a version of an inequality in 
Lavergne (2008): let E denote the conditional expectation given an arbitrary a— field, let 
A e W n x W and BgK"xR 9 be random matrices such that E(tr(A'A)),E(tr(B'B)) < oo 
and E(A'A) is non-singular. Then E(B'B) - E(B' A)E^ (A' A)E(A' B) is positive semi- 
definite, with equality iff B = AE" 1 (A' ' A)E(A' We also use the following notation: 
for any symmetric matrices Bi,B 2 , B\ ^> B 2 means B\ — B 2 is positive semi-definite. Let 



2 Likc in Lavergne (2008), let A = E" 1 {A' A)E(A' B). Then 

E[(B - AA)'(B - AA) = E(B'B) - E(B' A)E~ 1 {A' A)E(A' B) 



is clearly positive semi-definite, and is zero iff B = AA. 
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us write 

E[f K P) r (a{ p)] = E{[E{a\p\W) (1-tt) p ' | X*] E~ l {pp'\X*) p 

xp' E-\pp'\X*) {[E{a\p\W) (1-tt) p'\X*]}'} 
= E{[E(a{ p\ W) (1-tt) p'\X*} ET 1 (p p 1 \ X*) 
x{[E(alp\W) (1-tt) p'\X*}}'} 
(Cauchy-Schwarz) < E{E[E (a* p | (1 - tt) 2 E (p ' a*' \ W) \ X*] } 

= E [E{a\ p | W) (1 - tt) 2 E(p ' of | W)] 
(Cauchy-Schwarz) <C E 1 [(1 — 7r) 2 (aj p) (a^ p)'] 

This implies 



T{a\p) 



I? 



EUr 



T' (a* p) T{a\p) 



tr\E 



T (a{ p) T' (a* p) 



< sup [1 - 7r (iu)] ||a* p|| 2 2 < /3 |Kp|| 2 2 , 

w 

where (3 = sup [1 — tx {w)\ = 1 — inf 7r (w) < 1 by assumption (I5U|) . Deduce that T is a 
contracting map. 

5.4 Efficient score with parametric selection probability in 
regression-like models with missing data 

Let X^ = X*, X^ = W and the parameter vector 6 = (a',j')'. Moreover, let 



where 
a x (X* 



gi(z,9) 



p(Y,X*,a), g 2 (Z,6) 



7r(W,7) 

S e = a t (X*) gi (Z,9) + a 2 (W) g 2 (Z,9) 



7t(W, 7 ) 



- 1, 



O! (X«) 

-E [E (tt-^W, 7 o)5 I X*, <9 a p' | X*] E" 1 (tt'^W, 7o )p p' I X*) 



+ ^{^[a!(X*)p| W] (7r-\W, l0 )-l)p'\X*}E- 1 (7r- 1 (W, l0 )pp' \ X*) 
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If we partition ai (X*) in a x (X*) 



ai, Q (X*) 
«1,7 (**) 



and we use the same short notation 



as previously, the preceding equations can be written as 

a 1)Q (X*) = -E (e | X*, w) d a p' | X*^j E' 1 H- p p> \ X*) 
+E |e [a 1>a (X*) p\W] i^- 1 ) P'\X*} 

x ET 1 (I p p' | X*) , 
a 1|7 (X*) = £ |e [a li7 (X*) p | W] Q - 1J p' | X* j 

x et 1 f - p p' | x*) , 

with the obvious solution a lj7 = for the subvector of Hi corresponding to 7 (possibly 
not the unique solution, but any solution yields the same efficient score Sg). Similar 
calculations can be done for a 2 (W) : 

a 2 (W) = a 2 (X (2) ) 

/ \ 



7T 



1 - 7T 



-E [ai (X*) p I W] 



which gives, for a 2 (W) 
a 2 , a (W) 
a 2 , 7 (WO 

Therefore, 



a 2 ,« (W) 
a 2)7 (W) 



-E[a 1>a {X*) p I 



—— 9 7 tt - e [ai 7 (x*) p\w] = —— a 

1 — 7T 1 — 7T 



7T. 



= ai(X*) gi + a 2 (W) g 2 





H 


[s, J 





a 1>a (X*) gi + a 2>a {W) g 2 
a 2jJ (W) g 2 
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where 



a 1>a (X*) = -E {dj | X*) E' 1 {^pp'\ X*) 



+E { E [a lj0l (X*) p | W] ^—^ p' | X* 
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x E" 1 (I p p' | X*) 



a 2 , a (W) = -E [a 1>Q (X*) p \ W] 



a 2 , 7 (W) 



7i 



1 - 7T 

Now, for any s = b (W) ■ g 2 = b (W) 



— r (9 7 7T ( 7o, 

1 - 7T (W, 7o) 



5 



rr(W, l0 ) 



1 I G 7^", we have 



£ (S a s' \W) = E 



S a 



5 



Ti 



(W, 7o ) 



W 



V (W) 



E 



a 1>a (X*) -p + a 2 , a (W) [--I 



--1 



{E [a 1)Q (X*) p | W] + a 2 , a (W)} ( -J- - 1 ) b' (W) 



{E [a 1)Q (X*) p | W] - E [a 1>a (X*) p \ W}} ( - 1 ) 6' (W) 



so that, since = a 2n (W) • #2) we obtain 



E ( S n S~ 



E 



E[S a S 1 \W 



0. 



This means that the efficient score S a for a, equal to the residual of the (componentwise) 
projection of S a on S y , coincides with S a , 

S a = S a E (^S a V (Syj &f = S a , 

and has the same expression, as already noticed in Robins, Rotnitzky and Zhao (1994), 
as in the case where it (W) is completely known : 

S a = S a = a lja (X*) gt + a 2 , a (W) g 2 



a* (X*) gi + a* (W) g 2 . 
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